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Abstract 

In this work we study the quantitative relation between VC-dimension and two 
other basic parameters related to learning and teaching. Namely, the quality of 
sample compression schemes and of teaching sets for classes of low VC-dimension. 
Let C be a binary concept class of size m and VC-dimension d. Prior to this work, 
the best known upper bounds for both parameters were log(m), while the best lower 
bounds are linear in d. We present significantly better upper bounds on both as 
follows. Set k = O(d2 d log log |Cj). 

We show that there always exists a concept c in C with a teaching set (i.e. a 
list of c-labeled examples uniquely identifying c in C ) of size k. This problem was 
studied by Kuhlmann (1999). Our construction implies that the recursive teaching 
(RT) dimension of C is at most k as well. The RT-dimension was suggested by 
Zilles et al. and Doliwa et al. (2010). The same notion (under the name partial-ID 
width) was independently studied by Wigderson and Yehudayoff (2013). An upper 
bound on this parameter that depends only on d is known just for the very simple 
case d = 1, and is open even for d = 2. We also make small progress towards this 
seemingly modest goal. 

We further construct sample compression schemes of size k for C, with additional 
information of k log(fc) bits. Roughly speaking, given any list of C-labelled examples 
of arbitrary length, we can retain only k labeled examples in a way that allows 
to recover the labels of all others examples in the list, using additional k log A) 
information bits. This problem was first suggested by Littlestone and Warmuth 
(1986). 
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1 Introduction 


The study of mathematical foundations of learning and teaching has been very fruitful, re¬ 
vealing fundamental connections to various other areas of mathematics, such as geometry, 
topology, and combinatorics. Many key ideas and notions emerged from this study: Vap- 
nik and Chervonenkis’s VC-dimension [ 5], Valiant’s seminal definition of PAC learning 
[I I], Littlestone and Warmuth’s sample compression schemes [32], Goldman and Kearns’s 
teaching dimension [19], recursive teaching dimension (RT-dimension, for short) [48, 12, 40] 
and more. 

While it is known that some of these measures are tightly linked, the exact relationship 
between them is still not well understood. In particular, it is a long standing question 
whether the VC-dimension can be used to give a universal bound on the size of sample 
compression schemes, or on the RT-dimension. 

In this work, we make progress on these two questions. First, we prove that the 
RT-dimension of a boolean concept class C having VC-dimension d is upper bounded by 1 
0(d2 d log log 1(7]). Secondly, we give a sample compression scheme of size 0(d2 d log log IC'D 
that uses additional information. Both results were subsequently improved to bounds that 
are independent of the size of the concept class C [35, 9] 

Our proofs are based on a similar technique of recursively applying Haussler’s Packing 
Lemma on the dual class. This similarity provides another example of the informal con¬ 
nection between sample compression schemes and RT-dimension. This connection also 
appears in other works that study their relationship with the VC-dimension [ 2, 35, 9], 

1.1 VC-dimension 

VC-dimension and size. A concept class over the universe X is a set C C {0, 1} X ■ 
When X is finite, we denote |X| by n(C). The VC-dimension of C , denoted VC(C'), is 
the maximum size of a shattered subset of X, where a set Y C X is shattered if for every 
Z C Y there is c 6 d so that c(x) = 1 for all x G Z and c(x) = 0 for all x E Y — Z. 

The most basic result concerning VC-dimension is the Sauer-Shclah-Perles Lemma, 
that upper bounds \C\ in terms of n(C) and VC(C). It has been independently proved 
several times, e.g. in [42]. 

Theorem 1.1 (Sauer-Shelah-Perles). Let C be a boolean concept class with VC-dimension 
d. Then, 



In particular, if d > 2 then \C\ < n{C) d 

1 In this text O(f) means at most af + /3 for a, (3 > 0 constants. 
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VC-dimension and PAC learning. The VC-dimension is one of the most basic com¬ 
plexity measures for concept classes. It is perhaps mostly known in the context of the 
PAC learning model. PAC learning was introduced in Valiant’s seminal work [ l] as a the¬ 
oretical model for learning from random examples drawn from an unknown distribution 
(see the book [28] for more details). 

A fundamental and well-known result of Blumer, Ehrenfeucht, Hausslcr, and War- 
muth [ 8 ], which is based on an earlier work of Vapnik and Chervonenkis [ 5], states that 
PAC learning sample complexity is equivalent to VC-dimension. The proof of this theorem 
uses Theorem 1.1 and an argument commonly known as double sampling (see Section A 
in the appendix for a short and self contained description of this well known argument). 

Theorem 1.2 ([ 5], [8]). Let X be a set and C C {0,1} A ' be a concept class of VC- 
dimension d. Let n be a distribution over X. Let e,S > 0 and m an integer satisfying 
2(2m + l) d (l — e/4) m < S. Let c G C and Y — (aq,..., x m ) be a multiset of m independent 
samples from p. Then, the probability that there is c' G C so that c|y = d\y but p({x : 
c(x) 7 ^ c'(x)}) > e is at most S. 

VC-dimension and the metric structure. Another fundamental result in this area 
is Haussler’s [23] description of the metric structure of concept classes with low VC- 
dimension (see also the work of Dudley [ ]). Roughly, it says that a concept class C of 
VC-dimension d, when thought of as an Li metric space, behaves like a d dimensional 
space in the sense that the size of an e-separated set in C is at most (l/e) d . More formally, 
every probability distribution /a on X induces the (pseudo) metric 

dist M (c, c') = p,({x : c(x) c'(a;)}) 

on C. A set S' C C is called e-separated with respect to fi if for every two concepts c ^ c ' 
in S we have dist M (c, c') > e. A set A = A^{C, e) C C is called an e- approximating set 2 for 
C with respect to /i if it is a maximal e-separated set with respect to p. The maximality 
of A implies that for every c € C there is some rounding r — r(c, /i, C, e) in A so that r is 
a good approximation to c, that is, dist M (c, r) < e. We call r a rounding of c in A. 

An approximating set can be thought of as a metric approximation of the possibly 
complicated concept class C, and for many practical purposes it is a good enough substi¬ 
tute for C. Haussler proved that there are always small approximating sets. 

Theorem 1.3 (Haussler). Let C C {0,1} A be a concept class with VC-dimension d. Let 

2 In metric spaces such a set is called an e-net, however in learning theory and combinatorial geometry 
the term e-net has a different meaning, so we use e-approximating instead. 
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H be a distribution on X. Let e G (0,1]. If S is e-separated with respect to /j then 


|s| < e(d+1)l ^)‘<(^ 


A proof of a weaker statement. For m = 21og(|5|)/e, let x x ,... ,x m be independent sam¬ 
ples from /i. For every c^c' in S, 

Pr (V* G [m\ c(xi ) = c'(xf)) < (1 — e) m < e~ mt < 1/|S'| 2 . 


The union bound implies that there is a choice of Y C X of size |K| < m so that 


\S\ Y \ = |S|. Theorem 1.1 implies |5| < (|K| + l) d . Thus, |S| < (30dlog(2d/e)/e) c 


□ 


1.2 Teaching 

Imagine a teacher that helps a student to learn a concept c by picking insightful examples. 
The concept c is known only to the teacher, but c belongs to a class of concepts C known 
to both the teacher and the student. The teacher carefully chooses a set of examples 
that is tailored for c, and then provides these examples to the student. Now, the student 
should be able to recover c from these examples. 

A central issue that is addressed in the design of mathematical teaching models is “col¬ 
lusions.” Roughly speaking, a collusion occurs when the teacher and the student agree 
in advance on some unnatural encoding of information about c using the bit description 
of the chosen examples, instead of using attributes that separate c from other concepts. 
Many mathematical models for teaching were suggested: Shinohara and Miyano [43], 
Jackson and Tomkins [ 7], Goldman, Rivest and Schapire [.. ], Goldman and Kearns [19], 
Goldman and Mathias [20] Angluin and Krikis [2], Balbach [5], and Kobayashi and Shi¬ 
nohara [29]. We now discuss some of these models in more detail. 

Teaching sets. The first mathematical models for teaching [19, 43, 3] handle collusions 
in a fairly restrictive way, by requiring that the teacher provides a set of examples Y that 
uniquely identifies c. Formally, this is captured by the notion of a teaching set, which was 
independently introduced by Goldman and Kearns [19], Shinohara and Miyano [13] and 
Anthony et al. [3]. A set Y C A is a teaching set for c in C if for all c' j- c in C, we have 
d\ y ~f~ c|y. The teaching complexity in these models is captured by the hardest concept 
to teach, i.e., rnax ce cmin{|K : Y is a teaching set for c in C}. 

Teaching sets also appear in other areas of learning theory: Hanneke [ ] used it in 

his study of the label complexity in active learning, and the authors of [47] used variants 
of it to design efficient algorithms for learning distributions using imperfect data. 
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Defining the teaching complexity using the hardest concept is often too restrictive. 
Consider for example the concept class consisting of all singletons and the empty set over 
a domain A" of size n. Its teaching complexity in these models is n, since the only teaching 
set for the empty set is X. This is a fairly simple concept class that has the maximum 
possible complexity. 

Recursive teaching dimension. Goldman and Mathias [20] and Angluin and Krikis [ j 
therefore suggested less restrictive teaching models, and more efficient teaching schemes 
were indeed discovered in these models. One approach, studied by Zillcs et al. [48], Doliwa 
et al. [ 2], and Samei et al. [40], uses a natural hierarchy on the concept class C which is 
defined as follows. The first layer in the hierarchy consists of all concepts whose teaching 
set has minimal size. Then, these concepts are removed and the second layer consists 
of all concepts whose teaching set with respect to the remaining concepts has minimal 
size. Then, these concepts are removed and so on, until all concepts are removed. The 
maximum size of a set that is chosen in this process is called the recursive teaching (RT) 
dimension. One way of thinking about this model is that the teaching process satisfies an 
Occam’s razor-type rule of preferring simpler concepts. For example, the concept class 
consisting of singletons and the empty set, which was considered earlier, has recursive 
teaching dimension 1: The first layer in the hierarchy consists of all singletons, which 
have teaching sets of size 1. Once all singletons are removed, we are left with a concept 
class of size 1, the concept class {0}, and in it the empty set has a teaching set of size 0. 

A similar notion to RT-dimension was independently suggested in [ ] under the ter¬ 

minology of partial IDs. There the focus was on getting a simultaneous upper bound on 
the size of the sets, as well as the number of layers in the recursion, and it was shown 
that for any concept class C both can be made at most log \C\. Motivation for this study 
comes from the population recovery learning problem defined in [15]. 

Previous results. Doliwa et al. [ ] and Zilles et al. [ ] asked whether small VC- 

dimension implies small recursive teaching dimension. An equivalent question was asked 
10 years earlier by Kuhlmann [30]. Since the VC-dimension does not increase when 
concepts are removed from the class, this question is equivalent to asking whether every 
class with small VC-dimension has some concept in it with a small teaching set. Given the 
semantics of the recursive teaching dimension and the VC-dimension, an interpretation 
of this question is whether exact teaching is not much harder than approximate learning 
(i.e., PAC learning). 

For infinite classes the answer to this question is negative. There is an infinite concept 
class with VC-dimension 1 so that every concept in it does not have a finite teaching set. 
An example for such a class is C C {0, l} 1 ® defined as C = {c q : q G Q} where c q is the 
indicator function of all rational numbers that are smaller than q. The VC-dimension of 
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C is 1, but every teaching set for some c q E C must contain a sequence of rationals that 
converges to q. 

For finite classes this question is open. However, in some special cases it is known 
that the answer is affirmative. In [30] it is shown that if C has VC-dimension 1, then its 
recursive teaching dimension is also 1. It is known that if C is a maximum 3 class then 
its recursive teaching dimension is equal to its VC-dimension [12, 39]. Other families of 
concept classes for which the recursive teaching dimension is at most the VC-dimension 
are discussed in [ ]. In the other direction, [30] provided examples of concept classes 

with VC-dimension d and recursive teaching dimension at least 3 d. 

The only bound on the recursive teaching dimension for general classes was observed 
by both [ 2, 47]. It states that the recursive teaching dimension of C is at most log |Cj. 
This bound follows from a simple halving argument which shows that for all C there exists 
some c G C with a teaching set of size log \ C\. 

Our contribution. Our first main result is the following general bound, which expo¬ 
nentially improves over the log \C\ bound when the VC-dimension is small (the proof is 
given in Section 3). 

Theorem 1.4 (RT-dimension). Let C be a concept class of VC-dimension d. Then there 
exists c G C with a teaching set of size at most 

d2 d+3 (log(4e 2 ) + loglog |C|). 

It follows that the recursive teaching dimension of concept classes of VC-dimension d 
is at most d2 d+3 (log(4e 2 ) + log log |Cj) as well. 

Subsequent to this paper, Chen, Cheng, and Tang [9] proved that the RT-dimension 
is at most exp(d). Their proof is based on ideas from this work, in particular they follow 
and improve the argument from the proof of Lemma 1.7. 

1.3 Sample compression schemes 

A fundamental and well known statement in learning theory says that if the VC-dimension 
of a concept class C is small, then any consistent 4 algorithm successfully PAC learns 
concepts from C after seeing just a few labelled examples [45, 7]. In practice, however, a 
major challenge one has to face when designing a learning algorithm is the construction 
of an hypothesis that is consistent with the examples seen. Many learning algorithms 
share the property that the output hypothesis is constructed using a small subset of the 
examples. For example, in support vector machines, only the set of support vectors is 

3 That is, C satisfies Sauer-Shelah-Perles Lemma with equality. 

4 An algorithm that outputs an hypothesis in C that is consistent with the input examples. 
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needed to construct the separating hyperplane [11]. Sample compression schemes provide 
a formal meaning for this algorithmic property. 

Before giving the formal definition of compression schemes, let us consider a simple 
illustrative example. Assume we are interested in learning the concept class of intervals 
on the real line. We get a collection of 100 samples of the form (x,cj(x)) where x E R 
and C/(x) E {0,1} indicates 5 if x is in the interval / Cl. Can we remember just a few of 
the samples in a way that allows to recover all the 100 samples? In this case, the answer 
is affirmative and in fact it is easy to do so. Just remember two locations, those of the 
left most 1 and of the right most 1 (if there are no Is, just remember one of the 0s). From 
this data, we can reconstruct the value of cj on all the other 100 samples. 

The formal definition. Littlestone and Warmuth [32] formally defined sample com¬ 
pression schemes as follows. Let C C {0,1} A with |X| = n. Let 

L c (h,k 2 ) = {(y,j/) : Y C X, h < \Y\ <k 2 , yE Cjy}, 

the set of labelled samples from C, of sizes between k\ and k 2 . A /c-sample compression 
scheme for C with information Q, consists of two maps K,p for which the following hold: 

(k) The compression map 

k : L c ( 1, n) -E L c {0 , k) x Q 
takes (Y, y ) to ((Z, z), q ) with Z C Y and y\z = z. 

(p) The reconstruction map 

P ■ Lc{ 0, k) x Q —> {0,1} A 
is so that for all (Y,y) in Lc(l,n), 


p(k(Y, y))\ Y = y. 

The size of the scheme is k + log \ Q\. 

Intuitively, the compression map takes a long list of samples (Y, y) and encodes it as 
a short sub-list of samples (Z, z) together with some small amount of side information 
q E Q, which helps in the reconstruction phase. The reconstruction takes a short list 
of samples (Z, z) and decodes it using the side information q, without any knowledge of 
(y, y), to an hypothesis in a way that essentially inverts the compression. Specifically, the 
following property must always hold: if the compression of (Y, c|y) is the same as that of 
(Y f , C^y') then cjyny' = C^yhY'- 

5 That is c/(x) = 1 iff x € I. 
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A different perspective of the side information is as a list decoding in which the small 
set of labelled examples (Z,z) is mapped to the set of hypothesis {p((Z,z),q) : q G Q}, 
one of which is correct. 

We note that it is not necessarily the case that the reconstructed hypothesis belongs 
to the original class C. All it has to satisfy is that for any (Y,y) € Lc(l,ri) such that 
h = p(n(Y,y )) we have that h\ Y = y. Thus, h has to be consistent only on the sampled 
coordinates that were compressed and not elsewhere. 

Let us consider a simple example of a sample compression scheme, to help digest the 
definition. Let C be a concept class and let r be the rank over, say, M. of the matrix whose 
rows correspond to the concepts in C. We claim that there is an r-sample compression 
scheme for C with no side information. Indeed, for any Y C A", let Z Y be a set of at most 
r columns that span the columns of the matrix C\ Y . Given a sample (Y, y) compress it 
to k(Y, y) = (Z Y ,z) for z = y\z Y - The reconstruction maps p takes (Z,z) to any concept 
h G C so that h\z = z. This sample compression scheme works since if (Z, z) = n(Y,y) 
then every two different rows in C\ Y must disagree on Z. 

Connections to learning. Sample compression schemes are known to yield practical 
learning algorithms (see e.g. [3-1]), and allow learning for multi labelled concept classes [41]. 

They can also be interpreted as a formal manifestation of Occam’s razor. Occam’s 
razor is a philosophical principle attributed to William of Ockham from the late middle 
ages. It says that in the quest for an explanation or an hypothesis, one should prefer 
the simplest one which is consistent with the data. There are many works on the role 
of Occam’s razor in learning theory, a partial list includes [32, 7, 16, 37, 26, 17, 13]. 
In the context of sample compression schemes, simplicity is captured by the size of the 
compression scheme. Interestingly, this manifestation of Occam’s razor is provably useful 
[32]: Sample compression schemes imply PAC learnability. 

Theorem 1.5 (Littlestone-Warmuth). Let C C {0, 1} X , and c G C. Let p be a distribu¬ 
tion on X, and x \,..., x m be m independent samples from p. Let Y — (x ±,..., x m ) and 
y = c\ Y . Let k, p be a k-sample compression scheme for C with additional information Q. 
Let h = p(k(Y, y)). Then, 


Pr(ciist M (/i,c) > e) < \Q\ ^ ™ ) (1 - e) m j . 

** \3j 

Proof sketch. There are Xq=o ( 7 ) su bsets T of [m] of size at most k. There are \Q\ 
choices for q G Q. Each choice of T,q yields a function hx, q = p{{T,yT),q ) that is 
measurable with respect to Xt — (xt : t G T). The function h is one of the functions in 
{h T) q : \T\ < k,q G Q}. For each hx, q , the coordinates in [m] — T are independent, and 


so if dist M (/i Ti9 , c) > e then the probability that all these m — \T\ samples agree with c is 
less than (1 — e) m C T L The union bound completes the proof. □ 

The sample complexity of PAC learning is essentially the VC-dimension. Thus, from 
Theorem 1.5 we expect the VC-dimension to bound from below the size of sample com¬ 
pression schemes. Indeed, [ ] proved that there are concept classes of VC-dimension d 

for which any sample compression scheme has size at least d. 

This is part of the motivation for the following basic question that was asked by 
Littlestone and Warmuth [32] nearly 30 years ago: Does a concept class of VC-dimension 
d have a sample compression scheme of size depending only on d (and not on the universe 
size)? 

In fact, unlike the VC-dimension, the definition of sample compression schemes as 
well as the fact that they imply PAC learnability naturally generalizes to multi-class 
classification problems [ ]. Thus, Littlestone and Warmuth’s question above can be seen 

as the boolean instance of a much broader question: Is it true that the size of an optimal 
sample compression scheme for a given concept class (not necessarily binary-labeled) is 
the sample complexity of PAC learning of this class? 

Previous constructions. Floyd [ ] and Floyd and Warmuth [ ] constructed sample 

compression schemes of size log|C|. The construction in [ ] uses a transformation that 

converts certain online learning algorithms to compression schemes. Hclmbold and War¬ 
muth [26] and Freund [18] showed how to compress a sample of size m to a sample of size 
0(log(m)) using some side information for classes of constant VC-dimension (the implicit 
constant in the 0(- ) depends on the VC-dimension). 

In a long line of works, several interesting compression schemes for special cases were 
constructed. A partial list includes Helmbold et al. [ 5], Floyd and Warmuth [ 7 ], Ben- 
Davicl and Litman [6], Chernikov and Simon [10], Kuzmin and Warmuth [31], Rubinstein 
et al. [38], Rubinstein and Rubinstein [39], Livni and Simon [33] and more. These works 
provided connections between compression schemes and geometry, topology and model 
theory. 

Our contribution. Here we make the first quantitive progress on this question, since 
the work of Floyd [16]. The following theorem shows that low VC-dimension implies the 
existence of relatively efficient compression schemes. The constructive proof is provided 
in Section 4. 

Theorem 1.6 (Sample compression scheme). If C has VC-dimension d then it has a 
k-sample compression scheme with additional information Q where k = 0(d2 d log log |C|) 
and log|Q| < 0(klog(k)). 
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Subsequent to this paper, the first and the last authors improved this bound [ 35 ], 
showing that any concept class of VC-dimension d has a sample compression scheme of 
size at most exp(cf). The techniques used in [ 35 ] differ from the techniques we use in this 
paper. In particular, our scheme relics on Haussler’s Packing Lemma (Theorem 1.3) and 
recursion, while the scheme in [ 35 ] relies on von Neumann’s minimax theorem [ 36 ] and the 
e-approximation theorem [ 45 , 24 ] , which follow from the double-sampling argument of [ ]. 

Thus, despite the fact that our scheme is weaker than the one in [ 35 ], it provides a different 
angle on sample compression, which may be useful in further improving the exponential 
dependence on the VC-dimension to an optimal linear dependence, as conjectured by 
Floyd and Warmuth [17, 46 ]. 

1.4 Discussion and open problems 

This work provides relatively efficient constructions of teaching sets and sample compres¬ 
sion schemes. However, the exact relationship between VC-dimension, sample compres¬ 
sion scheme size, and the RT-dimension remains unknown. Is there always a concept with 
a teaching set of size depending only on the VC-dimension? (The interesting case is finite 
concept classes, as mentioned above.) Are there always sample compression schemes of 
size linear (or even polynomial) in the VC-dimension? 

The simplest case that is still open is VC-dimension 2. One can refine this case even 
further. VC-dimension 2 means that on any three coordinates x,y,z E X , the projection 
C\{ x ,y, z } has at most 7 patterns. A more restricted family of classes is (3,6) concept 
classes, for which on any three coordinates there are at most 6 patterns. We can show 
that the recursive teaching dimension of (3, 6) classes is at most 3. 

Lemma 1.7. Let C be a finite (3,6) concept class. Then there exists some c E C with a 
teaching set of size at most 3. 

Proof. Assume that C C (0, 1} X with X = [n]. If C has VC-dimension 1 then there exists 
c E C with a teaching set of size 1 (see [30, 1]). Therefore, assume that the VC-dimension 
of C is 2. Every shattered pair {x,x'} C X partitions C to 4 nonempty sets: 

c b’b' = {ceC \ c(x) = b, c(x') = b'}, 

for b,b' E (0,1}. Pick a shattered pair {x*,x*} and for which the size of Cfff * is 
minimal. Without loss of generality assume that {x*,x{} = {1,2} and that 5* = 5} = 0. 
To simplify notation, we denote C' ; ] simply by Cb,b>- 

We prove below that Co,o has VC-dimension 1. This completes the proof since then 
there is some c E C 0 , 0 and some x E [n] \ {1,2} such that {x} is a teaching set for c in 
Co,o- Therefore, {l,2,x} is a teaching set for c in C. 
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First, a crucial observation is that since C is a (3, 6 ) class, no pair (x, x'} C [n]\{l, 2} is 
shattered by both Co,o and C\Co,o- Indeed, if shatters {x, x'} then either C^oUCi,! 

or Coy U C\ i has at least 3 patterns on {x, x'}. If in addition Co,o shatters {x,x'} then 
C has at least 7 patterns on {l,x, x'} or {2,x,x'}, contradicting the assumption that C 
is a (3,6 ) class. 

Now, assume towards contradiction that C 0i o shatters {x,x'}. Thus, {x,x'} is not 
shattered by C \ Cop which means that there is some pattern p G {0, l}l x,a;, f so that 
p {C \ Co,o)|{x,x'}- This implies that C 3 p ^ , is a proper subset of Co,o, contradicting 
the minimality of Co,o- □ 

2 The dual class 

We shall repeatedly use the dual concept class to C and its properties. The dual concept 
class C* C {0,1} C of C is defined by C* = {c x : x G X }, where c x : C —» {0,1} is the 
map so that c x (c) = 1 iff c(x) = 1. If we think of C as a binary matrix whose rows are 
the concepts in C, then C* corresponds to the distinct rows of the transposed matrix (so 
it may be that |C*| < |n(C)|). 

We use the following well known property (see [ ]). 

Claim 2.1 (Assouad). If the VC-dimension of C is d then the VC-dimension of C* is at 
most 2 d+1 . 

Proof sketch. If the VC-dimension of C* is 2 d+1 then in the matrix representing C there 
are 2 d +i 

rows that are shattered, and in these rows there are d + 1 columns that are 
shattered. □ 

We also define the dual approximating set (recall the definition of A^C, e) from Sec¬ 
tion 1.1). Denote by A*{C,e) the set Ajj(C*,e), where U is the uniform distribution on 
C*. 

3 Teaching sets 

In this section we prove Theorem 1.4. The high level idea is to use Theorem 1.3 and 
Claim 2.1 to identify two distinct x, x' in X so that the set of c G C so that c(x) 7 ^ c(x') 
is much smaller than |C|, add x, x' to the teaching set, and continue inductively. 

Proof of Theorem l.f. For classes with VC-dimension 1 there is c G C with a teaching 
set of size 1, see e.g. [12]. We may therefore assume that d>2. 
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We show that if \C\ > (4e 2 ) d ' 2d+2 , then there exist x 7^ x' in X such that 

0 < |{c € C : c(x) = 0 and c(x') = 1}| < |C| 1_ ^+2. (1) 


From this the theorem follows, since if we iteratively add such x, x' to the teaching set and 
restrict ourselves to {c G C : c(x) = 0 and c(x') = 1}, then after at most d2 d+2 log log \C\ 
iterations, the size of the remaining class is reduced to less than (4e 2 ) d ' 2<<+ “. At this point 
we can identify a unique concept by adding at most log((4e 2 ) d ' 2d+ “) additional indices to 
the teaching set, using the halving argument of [12, 47]. This gives a teaching set of size 
at most 2d2 d+2 log log |C| + d2 d+2 log(4e 2 ) for some c G C, as required. 

I11 order to prove (1), it is enough to show that there exist c x 7^ c y in C* such that the 
normalized hamming distance between c x ,c y is at most e := \C\~ d2 d + 2 . Assume towards 
contradiction that the distance between every two concepts in C* is more than e, and 
assume without loss of generality that n{C) = |C*| (that is, all the columns in C are 
distinct). By Claim 2.1, the VC-dimension of C* is at most 2 d+1 . Theorem 1.3 thus 
implies that 


n(C) 


\C*\ < 




( 2 ) 


where the last inequality follows from the definition of e and the assumption on the size 
of C. Therefore, we arrive at the following contradiction: 


\C\ < (n(C)) d 



= \c\. 


(by Theorem 1.1, since VC(C) > 2) 

(by Equation 2 above) 
(by definition of e) 


□ 


4 Sample compression schemes 

In this section we prove Theorem 1.6. The theorem statement and the definition of sample 
compression schemes appear in Section 1.3. 

While the details are somewhat involved, due to the complexity of the definitions, the 
high level idea may be (somewhat simplistically) summarized as follows. 

For an appropriate choice of e, we pick an e-approximating set A* of the dual class C*. 
It is helpful to think of A* as a subset of the domain X. Now, either A* faithfully represents 
the sample (Y,y) or it does not (we do not formally define “faithfully represents” here). 
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We identify the following win-win situation: In both cases, we can reduce the compression 
task to that in a much smaller set of concepts of size at most e\C\ ~ |C| 1-2 d , similarly 
to as for teaching sets in Section 3. This yields the same double-logarithmic behavior. 

In the case that A* faithfully represents (Y, y ), Case 2 below, we recursively compress 
in the small class C\a*- In the unfaithful case, Case 1 below, we recursively compress 
in a (small) set of concepts for which disagreement occurs on some point of Y, just as 
in Section 3. In both cases, we have to extend the recursive solution, and the cost is 
adding one sample point to the compressed sample (and some small amount of additional 
information by which we encode whether Case 1 or 2 occurred). 

The compression we describe is inductively defined, and has the following additional 
structure. Let (( Z , z), q) be in the image of k. The information q is of the form q = (/, T), 
where T > 0 is an integer so that \Z\ < T + 0(d ■ 2 d ), and / : {0,1,..., T} —> Z is a 
partial one-to-one function 6 7 . 

The rest of this section is organized as follows. In Section 4.1 we define the compression 
map k. In Section 4.2 we give the reconstruction map p. The proof of correctness is 
given in Section 4.3 and the upper bound on the size of the compression is calculated in 
Section 4.4. 

4.1 Compression map: defining k 

Let C be a concept class. The compression map is defined by induction on n = n{C). 
For simplicity of notation, let d = VC(C) + 2. 

In what follows we shall routinely use A*(C, e). There are several e-approximating sets 
and so we would like to fix one of them, say, the one obtained by greedily adding columns 
to A* (C, e) starting from the first ' column (recall that we can think of C as a matrix 
whose rows correspond to concepts in C and whose columns are concepts in the dual class 
C*). To keep notation simple, we shall use A*{C,e) to denote both the approximating 
set in C* and the subset of A" composed of columns that give rise to A*(C,e). This is a 
slight abuse of notation but the relevant meaning will always be clear from the context. 

Induction base. The base of the induction applies to all concept classes C so that 
\C\ < (4e 2 ) d ' 2d+1 . In this case, we use the compression scheme of Floyd and Warmuth [ L6, 
] which has size log(|Cj) = 0(d ■ 2 d ). This compression scheme has no additional 
information. Therefore, to maintain the structure of our compression scheme we append 
to it redundant additional information by setting T = 0 and / to be empty. 

6 That is, it is defined over a subset of {0, 1,..., T} and it is injective on its domain. 

7 We shall assume w.l.o.g. that there is some well known order on X. 
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. Let 0 < e < 1 be so that 


Induction step. Let C be so that | C\ > (4e 2 ) d ' 2<i+1 



( 3 ) 


This choice balances the recursive size. By Claim 2.1, the VC-dimension of C* is at most 
2 d ~ 1 (recall that d = VC(C) + 2). Theorem 1.3 thus implies that 


\A*{C,e)\< 



( 4 ) 


(Where the second inequality follows from the definition of e and the assumption on the 


size of C and the last inequality follows from the definition of e and Theorem 1.1). 


Let (Y, y) G L c (l,n). Every x G X has a rounding 8 r(x) in A*{C,e). We distinguish 
between two cases: 

Case 1: There exist x G Y and c G C such that c\y — y and c(r(x )) ^ c(x). 


This is the unfaithful case in which we recurse as in Section 3. Let 


c' = {c'\x-{x,r(x)} ■ c' E C,c'(x) = c(x),c'(r(x)) = c(r(x))}, 


Y' = Y-{x,r(x)j, 
y' = y\ y<- 

Apply recursively k on C' and the sample {Y\ y') G L C '( 1, n(C')). Let ((Z\ z'), (f, T ')) 
be the result of this compression. Output ((Z,z), (f,T)) defined as 9 

Z = Z' U {a:}, 

z\z> = z', z(x) = y(x), 

T = T' + 1, 

/|{0,...,T—1} = / |{o,...,r-i} ) 


f(T) = x 


(/ is defined on T, marking that Case 1 occurred) 


Case 2: For all x G Y and cGC such that c\y = y, we have c(x) = c(r(x)). 

This is the faithful case, in which we compress by restricting C to A*. Consider 
r(Y) = {r(y') : y' G Y} C A*{C,e). For each x' G r(Y ), pick 10 s(x') G Y to be 


8 The choice of r(x) also depends on C, e, but to simplify the notation we do not explicitly mention it. 

9 Remember that / is a partial function. 

10 The function s can be thought of as the inverse of r. Since r is not necessarily invertible we use a 


different notation than r 1 
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an element such that r(s(x')) = x'. Let 


C' = C\ A . {c , e ), 

Y' = r(Y), 

y'(x') = y{s{x')) Wx' G Y'. 

By (4), we know \A* (C, e)| < n(C). Therefore, we can recursively apply k on 
C and (Y',y') G L c >{l,n{C')) and get (( Z',z '), (/',T')). Output ((Z,z), ( f,T)) 
dehned as 

Z = {s(aO : x' G Z'}, 

z(x) = z'(r(x )) Vx G Z, (r(x) G Z') 

T = T' + 1 , 

f = f. (/ is not dehned on T, marking that Case 2 occurred) 

The following lemma summarizes two key properties of the compression scheme. The 
correctness of this lemma follows directly from the definitions of Cases 1 and 2 above. 

Lemma 4.1. Let ( Y,y ) G Lc(l,n(C)) and ((Z, z), (T, /)) be the compression of (Y,y) 
described above, where T > 1. The following properties hold: 

1. f is defined on T and f(T) = x iff x Gf and there exists c G C such that c\y = y 

and c(r(x)) c(x). 

2. f is not defined on T iff for all x G Y and c G C such that c|y = y, it holds that 
c(x) = c(r(x)). 

4.2 Reconstruction map: defining p 

The reconstruction map is similarly dehned by induction on n(C). Let C be a concept 
class and let ((Z,z), (/, T)) be in the image 11 of k with respect to C. Let e = e(C) be as 
in (3). 

Induction base. The induction base here applies to the same classes like the induction 
base of the compression map. This is the only case where T = 0, and we apply the 
reconstruction map of Floyd and Warmuth [16, 17] 

n For ((Z, z), (/, T)) not in the image of n we set p((Z, z), (/, T)) to be some arbitrary concept. 
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Induction step. Distinguish between two cases: 

Case 1: / is defined on T. 

Let x = f(T). Denote 

X' = X - {x, r{x)}, 

C' = {d\x> '■ c 7 G C, c\x) = z(x), d(r(x)) = 1 — z(x)}, 

Z' = Z — {x, r(x)}, 
z' = z\z>, 

V = T - 1, 
f = 

Apply recursively p on C', ((Z', z'), (/', T')). Let h! G {0, 1} X ' be the result. 
Output h where 

h\ X ’ = h!, 
h(x) = z(x), 
h(r(x)) = 1 — z(x). 


Case 2: / is not defined on T. 

Consider r(Z) = {r(x) : x G Z} C A*(C,e). For each x' G r(Z), pick s(x') G Z 
to be an element such that r(s(x')) = x'. Let 

X’= A* (C,e), 

C' = C\ x , 7 
Z' = r(Z), 

z'(x') = z(s(x')) Vx' G Z', 

T = T - 1, 

f = /|{0,...,T'}- 

Apply recursively p on C", ((Z',z'), (f,T ')) and let h! G {0,1} X/ be the result. 
Output h satisfying 


/i(x) = h\r(x)) \/x G A". 


4.3 Correctness 

The following lemma yields the correctness of the compression scheme. 
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Lemma 4.2. Let C be a concept class, ( Y,y ) G L c {l,n), n{Y,y) = ((Z,z),{f,T)) and 
h = p(n{Y,y)). Then, 

1. Z C Y and z\z = y\z, and 

2- h\y = y\y- 

Proof. We proceed by induction on n(C). In the base case, \C\ < (4e 2 ) d ' 2 +1 and the 
lemma follows from the correctness of Floyd and Warmuth’s compression scheme (this 
is the only case in which T = 0). In the induction step, assume \C\ > (4e 2 ) d ' 2 +1 . We 
distinguish between two cases: 

Case 1: / is defined on T. 

Let x = f(T). This case corresponds to Case 1 in the definitions of n and Case 1 
in the definition of p. By Item 1 of Lemma 4.1, x G Y and there exists cGC 
and x G Y such that c\y = y and c(r(x)) ^ c(x). Let C", (' Y',y') be the class 
defined in Case 1 in the definition of k. Since n(C') < n(C), we know that k, p 
on C' satisfy the induction hypothesis. Let 

{{Z',z'),{f,T')) = K{C',(Y',y')), 
ti = p{C',{{Z\z'),{f',T'))), 

be the resulting compression and reconstruction. Since we are in Case 1 in the 
definitions of k and Case 1 in the definition of p, ((Z, z ), (/, T)) and h have the 
following form: 


Z = Z l U {x}, 

z\z' = z', z{x) = y(x), 

T = T + 1, 

/|{0,...,T—1} = I' | {0,...,T—1} j 

f{T) = x , 


and 


h\x~{x,r(x)} h j 

h(x) = z(x) = y{x) = c(x), 

h(r(x)) = 1 — z(x) = 1 — y{x) = 1 — c(x) = c(r(x)). 
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Consider item 1 in the conclusion of the lemma. By the definition of Y' and x, 

Y' U {x} C Y, (by the definition of Y') 

Z' C Y 1 . (by the induction hypothesis) 

Therefore, Z = Z' U {x} C Y. 

Consider item 2 in the conclusion of the lemma. By construction and induction, 
h |vn{i,r(i)} c\Yn{x,r(x)} y\Yn{x,r(x)} and h|y h | y' y ■ 

Thus, h\y = y- 

Case 2: / is not defined on T. 

This corresponds to Case 2 in the definitions of k and Case 2 in the definition of p. 
Let C", (Y',y') be the result of Case 2 in the definition of k. Since n(C') < n(C), 
we know that k, p on C satisfy the induction hypothesis. Let 

h' = p{C',({Z',z'),{fX))), 
s:Y' -Y Y, 

as defined in Case 2 in the definitions of k and Case 2 in the definition of p. By 
construction, ((Z,z), (/, T)) and h have the following form: 

Z = (s(x / ) : x' G Z'}, 
z(x) = z\r(x)) Vx G Z, 

T = T' + 1, 

/ = /', 

and 

h(x) = h'(r(x )) Vx G X. 

Consider item 1 in the conclusion of the lemma. Let x G Z. By the induction 
hypothesis, Z’ C Y'. Thus, x = s(x') for some x' G Z' C Y'. Since the range of s is Y, it 
follows that x G Y. This shows that Z C Y. 
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Consider item 2 in the conclusion of the lemma. For x G Y, 


(by the definition of h) 
(by the induction hypothesis) 
(by the definition of y' in Case 2 of n) 


h(x ) = h'(r(x )) 

= y'(r(x)) 

= y(s(r{x))) 

= y(x), 

where the last equality holds due to item 2 of Lemma 4.1: Indeed, let c G C be so that 
c\y = y■ Since / is not defined on T, for all x G Y we have c(x) = c(r(x)). In addition, 
for all x G Y it holds that r(s(r(x))) = r(x) and s(r(x)) G Y. Hence, if y(s(r(x))) ^ y(x) 
then one of them is different than c(r(x)), contradicting the assumption that we are in 
Case 2 of n. □ 

4.4 The compression size 

Consider a concept class C which is not part of the induction base (i.e. \C\ > (4e 2 ) d ' 2 +1 ). 
Let e = e(C) be as in (3). We show the effect of each case in the definition of k on either 
\C\ or n(C): 

1. Case 1 in the definition of k: Here the size of C' becomes smaller 

\C\ < e\C\. 

Indeed, this holds as in the dual set system C*, the normalized hamming distance 
between c x and c r < x \ is at most e and therefore the number of c G C such that 
c{x) ^ c(r(x)) is at most e\C\. 

2 . Case 2 in the dehnition of k: here n{C') becomes smaller as 

n (C') = \A‘(C,e)\< (1) . 

1 1 

We now show that in either cases, \C\ < \C\ d^+i ; which implies that after 

0((d • 2 d + 1 ) log log IC'D 

iterations, we reach the induction base. 

In Case 1: 


\c’\ < e\C\ = ICI 1 -^. 
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(by the dehnition of e) 




In Case 2: 


(by Theorem 1.1, since VC{C') < d — 2) 
(by Theorem 1.3, since n(C') = \A*(C, e)|) 

(by dehnition of e) 

Remark. Note the similarity between the analysis of the cases above, and the analysis 
of the size of a teaching set in Section 3. Case 1 corresponds to the rate of the progress 
performed in each iteration of the construction of a teaching set. Case 2 corresponds to 
the calculation showing that in each iteration significant progress can be made. 

Thus, the compression map k performs at most 

0((d • 2 d + 1) log log IC'D 

iterations. In every step of the recursion the sizes of Z and T increase by at most 1. In 
the base of the recursion, T is 0 and the size of Z is at most 0(d ■ 2 d ). Hence, the total 
size of the compression satisfies 

\Z\ < k — 0(2 d dloglog|C|), 
log(|Q|) < 0(k log(fc)). 

This completes the proof of Theorem 1.6. 


\C'\ < {n(C')) d 



= IC] 1 d ‘* dt 1 . 
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A Double sampling 

Here we provide onr version of the double sampling argument from [8] that upper bounds 
the sample complexity of PAC learning for classes of constant VC-dimension. We use the 
following simple general lemma. 

Lemma A.l. Let (fl,F, /a) and (O', T l , p') be countable 12 probability spaces. Let 


Fi, F 2 , F 3 ,...gJ, F[, F', 


be so that p.'iFf) > 1/2 for all i. Then 




where p x p! is the product measure. 

Proof. Let F = [J i F i . For every co G F, let F'(co) = th ere exists i such 

that oo G F i: it holds that F' C F'{oo) and hence p'{F'{uj )) > 1/2. Thus, 


p x p 



MM) • m'(F'M) > £ n({<j})/2 = MF)/2. 

ueF 


□ 

We now give a proof of Theorem 1.2. To ease the reading we repeat the statement of 
the theorem. 

Theorem. Let X be a set and C C {0,1} A be a concept class of VC-dimension d. Let p be 
a distribution over X. Let e, 6 > 0 and m an integer satisfying 2(2 m + l) d (l — e/4) m < S. 
Let c 6 C and Y = (xi,... , x m ) be a multiset of m independent samples from p. Then, 
the probability that there is d e C so that c|y = c'|y but p({x : c(x) ^ c'(x)}) > e is at 
most 5. 

Proof of Theorem 1.2. Let Y' = {x \,... ,x' m ) be another m independent samples from p, 
chosen independently of Y. Let 

H = {h G C : dist ^{h, c ) > e}. 

12 A similar statement holds in general. 
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For h G C, define the event 

F h = {Y : c|y = h\ Y }, 

and let F = [j heH F h . Onr goal is thus to upper bound Pr(F). For that, we also dehne 
the independent event 

F' h = {¥' : dist Y ,(h,c) > e/2}. 

We hrst claim that Pr(F/) >1/2 for all h G H. This follows from Chernoff’s bound, but 
even Chebyshev’s inequality suffices: For every i G [m], let Vi be the indicator variables 
of the event h(x') 7 ^ c(x') (i.e., Vi = 1 if and only if h(x') 7 ^ c(a/)). The event F' h is 
equivalent to V = V)/m > e/2. Since h G H, we have p := E[V] > e. Since elements 
of Y' are chosen independently, it follows that Var(U) = p(l—p)/m. Thus, the probability 
of the complement of F' h satisfies 

Pr («n < Pr(|V - Pi >p- 6/2) < , P(1 7„/ < — < 1 / 2 . 

(p — e/2 )-rn em 

We now give an upper bound on Pr(F). We note that 

Pr(F) < 2Pr | |^J F h x F/ J . (Lemma A.l) 

\h£H J 


Let S = hUP, where the union is as multisets. Conditioned on the value of S, the 
multiset Y is a uniform subset of half of the elements of S. Thus, 



2 E [E [l{3her/:h|y=c|j-, dist y /(/i,c)>e/2}|‘S']] 

s 

2E [E [l{3h'ei?| S'.h'\Y=c\Y , disty/ (h',c)>e/2} id] 

s 


< 2 E 
s 


^ ^ E [l{/i'|y=c|y, dist v /(/i',c)>e/2} | *5] 
h'eH\ s 

(by the union bound) 


Notice that if disty/(h', c) > e/2 then d\sts(h',c) > e/4, hence the probability that we 
choose Y such that h'\y = c\y is at most (1 — e/4) m . Using Theorem 1.1 we get 


Pr(F) < 2E 
s 


£ (!-6/4) 

h'eH\ s 


< 2(2 m + l) d (l 


e/4) 
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